1 Introduction

We will demonstrate the functionality of pseudoDrift by data simulation followed by outlier identification then signal drift correction. To get started, make sure you have the most recent development version from GitHub installed. Throughout this document, QC is meant to describe quality control or replicate samples of the same type evaluated within and or between batches in metabolomics studies. These are typically used to identify and correct technical errors or other anomalies which may occur during the data acquisition phase of an experiment.

# install.packages("remotes")
# remotes::install_github("jrod55/pseudoDrift")

library(pseudoDrift)
library(tidyverse)
library(data.table)

2 Data simulation

To simulate data, first we need an indexed and validated .sdf file to pull the metadata associated with the compound we are interested in. We are using a small .sdf file which is included as part of this package, but to construct your own, simply run the command below. The sdf2index function uses a few ChemmineR R package1 functions to create an indexed file so that the entire .sdf file doesn’t need to be loaded into memory. The .sdf file included in the package was obtained from the MoNA online database.

# sdf2index(sdf_input = "/path_to_your/desired.sdf", out_name = "Index.xls") # Example
# sdf2index(sdf_input = system.file("extdata", "test.sdf", package = "pseudoDrift"), out_name = "Index.xls") # Can run, but not necessary for this vignette

The simulate_data function takes as input a compound name (which should be present in the .sdf file) or a db ID (also should be present in the .sdf file) and simulates peak matrices using m/z values for the queried compound.The number of batches nbatch, number of samples per batch nsamps_per_batch, and the frequency of QC samples QC_freq in each batch can be adjusted by changing the commented lines below. There are additional parameters which allow the user to adjust the variance, scale, and severity of batch and drift effects. Lastly, changing the seed will generate a new simulation.

sim1 = simulate_data(compound_names = "tricin",
                     xls_file_name = system.file("extdata", "Index.xls", package = "pseudoDrift"),
                     valid_sdf_file = system.file("extdata", "valid-test.sdf", package = "pseudoDrift"),
                     # nbatch = 3,
                     # nsamps_per_batch = c(100, 200, 300),
                     # QC_freq = c(25, 25, 25),
                     # seed = 123
                     )

The chunk of code above generates simulated data for each instance of ‘tricin’ found in the .sdf file. Here we are using the default parameters (commented out) which simulate data for three batches, each with one hundred more samples than the previous batch, and with a QC sample included every twenty five samples.

The sim1 object is a nested tibble which contains all metadata available for the compound, along with the simulated data (sim_mat) and four signal drift and batch effect types: 1) monotonic (t1_sim_mat) , 2) batch-to-batch block effect (t2_sim_mat) , 3) random (t3_sim_mat), 4) monotonic coupled with batch-to-batch effect (t4_sim_mat).

Column names of the nested tibble returned by simulate_data

names(sim1)
##  [1] "id"                  "sim_mat"             "t1_sim_mat"         
##  [4] "t2_sim_mat"          "t3_sim_mat"          "t4_sim_mat"         
##  [7] "name_in_tibble"      "name"                "synonyms"           
## [10] "precursor_type"      "spectrum_type"       "precursor_m_z"      
## [13] "instrument_type"     "instrument"          "collision_energy"   
## [16] "ion_mode"            "inchikey"            "formula"            
## [19] "exact_mass"          "mw"                  "contributor"        
## [22] "comment"             "num_peaks"           "mass_spectral_peaks"

Note that in the tibble above, columns 2:6 are nested tibbles which is where the simulation outputs are stored. We can isolate an individual row using the compound id from in the first column then pull the corresponding simulated data using list indexing syntax. In the code below, we are isolating the simulated data from the first id: FIO00738.

sim1_sub = sim1 %>% 
  filter(id=="FIO00738")

sim_dat = sim1_sub$sim_mat[[1]]         # Simulated no effects
mono = sim1_sub$t1_sim_mat[[1]]         # Monotonic
b2b = sim1_sub$t2_sim_mat[[1]]          # Batch-to-batch
rando = sim1_sub$t3_sim_mat[[1]]        # Random
mono_b2b = sim1_sub$t4_sim_mat[[1]]     # Monotonic and batch-to-batch

Lets visualize the signal drift trends simulated, highlighting the QC samples we simulated.

# Small plotting function
plt_fun = function(x,dat_lab){
  qc_samps = x %>% filter(sample=="QC")
  plt = ggplot(x, aes(batch_index, area))+
    geom_point(size = 0.75)+
    geom_point(size = 2, data = qc_samps, aes(color=sample))+
    geom_line(data = qc_samps, aes(color=sample))+
    facet_grid(cols = vars(batch), scales = "free")+
    labs(title = paste0(dat_lab), y = "area_simulated")
  return(plt)
}
plt_fun(sim_dat, "Simulated")

plt_fun(mono, "Monotonic")

plt_fun(b2b, "Batch-to-batch")

plt_fun(rando, "Random")

plt_fun(mono_b2b, "Monotonic and batch-to-batch")

For the remainder of the vignette, we’ll be focusing on the monotonic + batch-to-batch simulated data. Primarily because this is what is most typically observed in real data sets.

3 Pairwise outlier identification

It is a challenging task to identify and remove outliers from any set of data, and having limited replication can further complicate this task. Here we introduce an outlier identification method pw_outlier which compares within sample pairwise differences (among individual sample replicates) to the distribution of all differences among samples. To illustrate how this procedure works, lets analyse our simulated data with the pw_outlier function.

# Assign replicates to simulated data. Note no technical replicates are included here, assuming only biological replicates.
n_reps = 3
qcs = mono_b2b %>%
  filter(sample=="QC")
tmp = mono_b2b %>%
  filter(!sample=="QC") %>%
  mutate(n_samps = n()/all_of(n_reps))
dat_rep = rep(1:n_reps, unique(tmp$n_samps))
dat_sam = paste0("S",rep(1:unique(tmp$n_samps), each = n_reps))

# The Monotonic + batch-to-batch simulated data with replicateds assigned 
mono_b2b_WR = tmp %>%
  mutate(rep = all_of(dat_rep),
         sample = all_of(dat_sam),
         rep_tech = 1) %>%
  bind_rows(.,qcs) %>%
  arrange(experiment_index)

pw_out = pw_outlier(df = mono_b2b_WR, 
                    return_plot = TRUE, 
                    samps_exclude = "QC")
##  
##  -------------- 
##  Done! 
##  --------------
pw_out$batch_plots
## [[1]]

The plots returned by pw_outlier show the density distribution of all within sample pairwise differences. For example, for sample S1 with reps 1, 2, 3 the pairwise differences would be computed as |1-2|, |1-3|, |2-3|. The density curve plotted for each batch shows all pairwise differences between replicates of an individual sample. As expected, most differences are small, which results in a positively skewed distribution. The blue vertical lines and the corresponding values are the threshold difference value to be applied in outlier identification per batch.

# Make a plot of the data removed by pairwise outlier removal
df_rm = pw_out$df_rm %>%
  mutate(sample = "Outlier")
plt_fun(pw_out$df_cleaned, "Monotonic and batch-to-batch pairwise cleaned")+
  geom_point(data = df_rm, size = 1.25, aes(fill=sample), color = "blue")

4 Signal drift correction

Suppose that QC samples were included in batch 3 alone. This would present issues for QC-based correction methods which adjust the data relying on the presence of QCs across all batches. We introduce a method pseudo_sdc to estimate what we call ‘pseudoQC’ samples from all non-QC samples in a training batch.

The training batch contains QC samples and is used to find the optimal set of three parameters (test.breaks, test.window, test.index) which are used to estimate pseudoQCs and minimize the criteria (set by the user) between pseudoQCs and true QCs.

Internally, the psuedo_sdc function fits all possible regression splines with the derived pseudoQC samples calculated from the data in the training batch. These regressions are fit using the Quality Control-Robust Spline Correction QCRSC function from the pmp R package2 and finally the optimal combination of input parameters are applied to the training data to estimate ‘pseudoQC’ samples and apply the correction between batches and for signal drift.

# Use the pw cleaned data
mono_b2b_cleaned = pw_out$df_cleaned

# A good starting point for parameters to test
train.batch = "B3"
df_param = mono_b2b_cleaned %>%
  filter(!sample=="QC") %>%
  group_by(batch) %>%
  summarise(samps = n_distinct(name))

# Sample size of smallest batch
s_perBatch = min(df_param$samps)

# Test breaks in smallest batch, using the number of QC samples simulated there
mnqc = 25
t.b = round(s_perBatch/mnqc)+1

# Proportional in larger training set
df_param = df_param %>%
  mutate(bre = round(samps*all_of(t.b)/all_of(s_perBatch)))
t.b.train = df_param %>%
  filter(batch == all_of(train.batch)) %>%
  pull(bre)
t.b.train = seq(t.b.train-3, t.b.train+3, 1)

# Test window for median smoothing. Should be a vector of odd numbers
w.n.max = round(min(s_perBatch/t.b))/2
w.n = 2*floor(w.n.max/2)+1
if (w.n>w.n.max) {
  w.n = w.n-2
}
w.n = seq(1,w.n,2)

# Test index offset. Larger values can be tested with larger datasets
ti.max = round(s_perBatch*0.15)
t.i = seq(0,ti.max,1)

# test.breaks = t.b.train     # Test breaking up the training batch into 10 to 12 equally sized sections
# test.window = w.n           # Test taking the median every 1 (reduces to the mean per test.break), to 9 samples. This sequence of numbers should be all odd.
# test.index = t.i            # Test offsetting the injection order of pseudoQC samples estimated
# criteria = "RSD"            # This can be one of: "RSD" (relative standard deviation using standard deviation and mean), "RSD_robust" (relative standard deviation using median absolute deviation and median), "MSE" (mean squared error), or "TSS" (total sum of squares). This is the criteria to be minimized.
# n.cores = 15                # Number of cores to use if your machine has the ability to multi-thread processes.
# quantile_increment = 0.05   # Quantile values (above/below increment) of the data surrounding QCs to use to estimate pseudoQCs.

# To reduce computing time, we are going to use values previously obtained by running the exhaustive set of tests with the inputs commented out above.

# For MSE and TSS criteria:
# test.breaks = 12 ; test.window = 7; test.index = 5; criteria = "MSE"; quantile_increment = 0.05; n.cores = 1

# For RSD:
test.breaks = 13 ; test.window = 9; test.index = 14; criteria = "RSD"; quantile_increment = 0.05; n.cores = 1

sdc_out = pseudo_sdc(df = mono_b2b_cleaned,
                     n.cores = n.cores,
                     train.batch = train.batch,
                     test.breaks = test.breaks,
                     test.window = test.window,
                     test.index = test.index,
                     criteria = criteria,
                     qc.label = "QC",
                     min.qc = min(test.breaks),
                     quantile_increment = quantile_increment)
## [1] "Running...testing 32 combinations of input parameters"
## [1] "The best fit for TRICIN_SIMULATED in training batch B3 is:"
## [1] "With median smoothing every 9 samples to generate pseudo-pools."
## [1] "Index (injection order) offset by 14, and splitting the batch into 13 sections."
## [1] "Using peak area values between quantiles 0.05 and 1"
##  
##  -------------- 
##  Done! 
##  --------------
# slightly larger plotting function
plt_fun1 = function(x,train.batch){
  metab = unique(x$df$compound)
  x1 = x$df_pseudoQC %>% 
    group_by(batch) %>% 
    mutate(index = 1:n(),
           class = factor(class, levels = c("QC", "Pseudo_QC", "Sample")))
  
  qcs1 = x1 %>% 
    filter(class%in%c("QC","Pseudo_QC")) %>% 
    mutate(sample = class)
  
  plt1 = ggplot(x1, aes(index, area))+
    geom_point()+
    geom_point(data = qcs1, aes(color=sample))+
    geom_line(data = qcs1, aes(color=sample))+
    facet_grid(cols = vars(batch), scales = "free")+
    labs(title = paste0(metab," raw data using ",train.batch, " data to train pseudo-QC"))
  
  legend = cowplot::get_legend(plt1)
  
  x2 = x$df_pseudoQC_corrected %>% 
    group_by(batch) %>% 
    mutate(index = 1:n(),
           class = factor(class, levels = c("QC", "Pseudo_QC", "Sample")))
  
  qcs2 = x2 %>% 
    filter(class%in%c("QC","Pseudo_QC")) %>% 
    mutate(sample = class)
  
  plt2 = ggplot(x2, aes(index, area_corrected))+
    geom_point()+
    geom_point(data = qcs2, aes(color=sample))+
    geom_line(data = qcs2, aes(color=sample))+
    facet_grid(cols = vars(batch), scales = "free")+
    labs(title = paste0(metab," pseudo-QC corrected data"))
  
  cplt = cowplot::plot_grid(plt1, plt2, ncol = 1, labels = c("A","B"))
  return(cplt)
}

plt_fun1(sdc_out,train.batch)

The figure above illustrates the raw data in (A) and the signal drift and batch corrected data using pseudo_sdc (B).

5 Comparing correction with originally simulated data

In the previous section we are able to see how well pseudoQCs are able to correct the data with respect to QC samples alone. Lets see how well the correction does for non-QC samples, using the originally simulated data.

cor_dat = sdc_out$df_pseudoQC_corrected %>%
  select(name,area_corrected, class) %>%
  left_join(., sim_dat) %>%
  drop_na(area)
## Joining, by = "name"
cor_dat %>% 
  ggplot(., aes(area_corrected, area, color = batch))+
  geom_point(alpha=0.5)+
  labs(x = "area_corrected_with_pseudoQC",
       y = "original_area_simulated_without_any_effects")

The figure shows that across batches, the correction with pseudoQCs does does a relatively good job of correcting the data back to our originally simulated. Lets get some concrete metrics for how well the modeling performed.

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
fc = trainControl(method = "cv", number = 10)
fit = train(area ~ area_corrected, 
             data = cor_dat, 
             method = "lm", 
             trControl = fc)
fit
## Linear Regression 
## 
## 576 samples
##   1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 518, 517, 520, 520, 517, 517, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   3509.917  0.7709064  2738.267
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
par(mfrow = c(2, 2))
plot(fit$finalModel)

Finally, lets determine how pseudoQC compares to doing a true QC correction using the pmp R package2

## A short function to format the data how pmp expects it
m_qcrsc = function(x, y){
  x = x %>% 
    mutate(class = ifelse(sample%in%all_of(y), "QC", "Sample"))
  t_meta = colnames(x)
  t_meta = t_meta[!t_meta%in%c("name", "compound")]
  tqc = x %>%
    pivot_wider(id_cols = !all_of(t_meta), names_from = name, values_from = area)
  t_rn = tqc$compound
  tqc = as.matrix(tqc[,-1])
  rownames(tqc) = t_rn
  mc = pmp::QCRSC(df=tqc,
                  minQC = 4,
                  order=seq_along(x$experiment_index),
                  batch=x$batch,
                  classes=x$class,
                  qc_label = paste0(y))
  z = x %>%
    mutate(area_corrected = mc[1,], .before = area)
  return(z)
  }

## Run the function then conduct the same diagnostics tests done with pseudoQC above. 
true_QC = m_qcrsc(mono_b2b_cleaned, "QC")
## The number of NA and <= 0 values in peaksData before QC-RSC: 0
cor_dat_true_QC = true_QC %>%
  select(name,area_corrected, class) %>%
  left_join(., sim_dat) %>%
  drop_na(area)
## Joining, by = "name"
cor_dat_true_QC %>% 
  ggplot(., aes(area_corrected, area, color = batch))+
  geom_point(alpha=0.5)+
  labs(x = "area_corrected_with_true_QC",
       y = "original_area_simulated_without_any_effects")

fit_true_QC = train(area ~ area_corrected, 
             data = cor_dat_true_QC, 
             method = "lm", 
             trControl = fc)
fit_true_QC
## Linear Regression 
## 
## 576 samples
##   1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 519, 518, 518, 518, 518, 519, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   4004.659  0.6969839  3253.792
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
par(mfrow = c(2, 2))
plot(fit_true_QC$finalModel)

6 Session information

sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] caret_6.0-88      lattice_0.20-44   data.table_1.14.0 pseudoDrift_1.0.0
##  [5] forcats_0.5.1     stringr_1.4.0     dplyr_1.0.7       purrr_0.3.4      
##  [9] readr_2.0.1       tidyr_1.1.3       tibble_3.1.6      ggplot2_3.3.5    
## [13] tidyverse_1.3.1   BiocStyle_2.20.2 
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1                backports_1.2.1            
##   [3] plyr_1.8.6                  splines_4.1.1              
##   [5] listenv_0.8.0               usethis_2.0.1              
##   [7] ChemmineR_3.44.0            GenomeInfoDb_1.28.2        
##   [9] digest_0.6.28               foreach_1.5.1              
##  [11] htmltools_0.5.2             magick_2.7.3               
##  [13] fansi_0.5.0                 magrittr_2.0.1             
##  [15] memoise_2.0.0               tzdb_0.1.2                 
##  [17] remotes_2.4.0               globals_0.14.0             
##  [19] recipes_0.1.16              gower_0.2.2                
##  [21] modelr_0.1.8                matrixStats_0.60.1         
##  [23] vroom_1.5.4                 prettyunits_1.1.1          
##  [25] colorspace_2.0-2            rvest_1.0.1                
##  [27] haven_2.4.3                 xfun_0.25                  
##  [29] callr_3.7.0                 crayon_1.4.2               
##  [31] RCurl_1.98-1.4              jsonlite_1.7.2             
##  [33] missForest_1.4              impute_1.66.0              
##  [35] survival_3.2-13             zoo_1.8-9                  
##  [37] iterators_1.0.13            glue_1.5.1                 
##  [39] gtable_0.3.0                ipred_0.9-11               
##  [41] zlibbioc_1.38.0             XVector_0.32.0             
##  [43] DelayedArray_0.18.0         pkgbuild_1.2.0             
##  [45] future.apply_1.8.1          BiocGenerics_0.38.0        
##  [47] scales_1.1.1                DBI_1.1.1                  
##  [49] Rcpp_1.0.7                  bit_4.0.4                  
##  [51] lava_1.6.10                 prodlim_2019.11.13         
##  [53] stats4_4.1.1                rsvg_2.1.2                 
##  [55] DT_0.19                     htmlwidgets_1.5.3          
##  [57] httr_1.4.2                  ellipsis_0.3.2             
##  [59] pkgconfig_2.0.3             farver_2.1.0               
##  [61] nnet_7.3-16                 dbplyr_2.1.1               
##  [63] utf8_1.2.2                  janitor_2.1.0              
##  [65] tidyselect_1.1.1            labeling_0.4.2             
##  [67] rlang_0.4.12                reshape2_1.4.4             
##  [69] munsell_0.5.0               cellranger_1.1.0           
##  [71] tools_4.1.1                 cachem_1.0.6               
##  [73] cli_3.1.0                   generics_0.1.0             
##  [75] devtools_2.4.2              broom_0.7.9                
##  [77] evaluate_0.14               fastmap_1.1.0              
##  [79] yaml_2.2.1                  ModelMetrics_1.2.2.2       
##  [81] processx_3.5.2              knitr_1.33                 
##  [83] bit64_4.0.5                 fs_1.5.0                   
##  [85] randomForest_4.6-14         future_1.22.1              
##  [87] nlme_3.1-152                pmp_1.4.0                  
##  [89] xml2_1.3.2                  compiler_4.1.1             
##  [91] rstudioapi_0.13             png_0.1-7                  
##  [93] testthat_3.0.4              reprex_2.0.1               
##  [95] stringi_1.7.4               highr_0.9                  
##  [97] ps_1.6.0                    desc_1.3.0                 
##  [99] Matrix_1.3-4                vctrs_0.3.8                
## [101] pillar_1.6.4                lifecycle_1.0.1            
## [103] BiocManager_1.30.16         cowplot_1.1.1              
## [105] bitops_1.0-7                GenomicRanges_1.44.0       
## [107] R6_2.5.1                    pcaMethods_1.84.0          
## [109] bookdown_0.24               gridExtra_2.3              
## [111] parallelly_1.27.0           IRanges_2.26.0             
## [113] sessioninfo_1.1.1           codetools_0.2-18           
## [115] MASS_7.3-54                 assertthat_0.2.1           
## [117] pkgload_1.2.1               SummarizedExperiment_1.22.0
## [119] rprojroot_2.0.2             rjson_0.2.20               
## [121] withr_2.4.3                 S4Vectors_0.30.0           
## [123] GenomeInfoDbData_1.2.6      parallel_4.1.1             
## [125] hms_1.1.0                   rpart_4.1-15               
## [127] grid_4.1.1                  timeDate_3043.102          
## [129] class_7.3-19                rmarkdown_2.10             
## [131] snakecase_0.11.0            MatrixGenerics_1.4.3       
## [133] pROC_1.18.0                 itertools_0.1-3            
## [135] Biobase_2.52.0              lubridate_1.7.10           
## [137] base64enc_0.1-3
(1)
Cao, Y. E.; Horan, K.; Backman, T.; Girke, T. ChemmineR: Cheminformatics Toolkit for r; 2021.
(2)
Jankevics, A.; Lloyd, G. R.; Weber, R. J. M. Pmp: Peak Matrix Processing and Signal Batch Correction for Metabolomics Datasets; 2021.